Cricket IPL Score Prediction - Linear Regression - Regression
Demo
Live Web App Demo Link
Deployment on Heroku: http://mlcricketscoreprediction.herokuapp.com/
Abstract
The purpose of this report will be to use the IPL.csv Data to predict the score of cricket match.
This can be used to gain insight into how and why cricket score varies depending on the wickets and runs etc. This can also be used as a model to gain a marketing advantage, by advertisement targeting those who are more likely to win cricket match and which team has more chances to lose. Score Prediction is a regression problem, where using the match data and model building will predict what range of score a team can achieve.
This is diving into Prediction of Cricket Score through Machine Learning Concept. End to End Project means that it is step by step process, starts with data collection, EDA, Data Preparation which includes cleaning and transforming then selecting, training and saving ML Models, Cross-validation and Hyper-Parameter Tuning and developing web service then Deployment. This repository contains the code for Predicting Cricket Score using python’s various libraries.
It used pandas, sklearn, datetime and pickle libraries. These libraries help to perform individually one particular functionality. Pandas objects rely heavily on Numpy objects. Sklearn has 100 to 200 models. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream. Datetime module supplies classes to work with date and time. The purpose of creating this repository is to gain insights into Complete ML Project. These python libraries raised knowledge in discovering these libraries with practical use of it. It leads to growth in my ML repository. This above screenshot and Video_File Folder will help you to understand flow of output.
Motivation
The reason behind building this is, because till now I have worked on individual concepts so I wanted to combine all things that I have learnt till now and create a End to End Project that shows whole life cycle of ML Project. In addition to that, as an employee of a company, I should be able to carry out an entire process own is also essential aspect of it. Building end to end project is gave me wholesome approach to handle given data. Hence, I continue to gain knowledge while practicing the same and spread literary wings in tech-heaven.
Acknowledgment
Dataset Available: https://www.kaggle.com/yuvrajdagur/ipl-dataset-season-2008-to-2017
The Data
Here are all the features included in the data set, and a short description of them all.
It has 76014 total observations and 15 columns.
This data has been cleaned of any null values.
Analysis of the Data
Let’s start by doing a general analysis of the data as a whole, including all the features the Linear Regression algorithm will be using.
Basic Statistics
Graphing of Features
Graph Set 1
Modelling
Math behind the metrics
Linear Regression is a predictive algorithm which provides a Linear relationship between Prediction (Call it ‘Y’) and Input (Call is ‘X’). As we know from the basic maths that if we plot an ‘X’,’Y’ graph, a linear relationship will always come up with a straight line. For example, if we plot the graph of these values.
Model Architecture Process Through Visualization
Linear Regression Architecture:
Quick Notes
Step 1: Imported essential libraries.
Step 2: Loaded the dataset.
Step 3: Performed Data Cleaning.
- Removed unwanted columns.
- Kept only consistent teams.
- Removed first 5 overs data in every match.
- Converted column ‘date’ from string into datetime object.
Step 4: Performed Data Pre-processing.
- Converted categorical features using OneHotEncoding Method that is .get_dummies() function.
- Re-arranged the columns.
Step 5: Splitted the data.
- Removed ‘date’ column.
Step 6: Built the model.
- Fitted the data on Linear Regression Model.
Step 7: Saved the model as pickle file to re-use it again.
Step 8: Created Web App.
The Model Analysis
import pandas as pd
import pickle Imported essential libraries – When we import modules, we're able to call functions that are not built into Python. Some modules are installed as part of Python, and some we will install through pip. Making use of modules allows us to make our programs more robust and powerful as we're leveraging existing code.
df = pd.read_csv('ipl.csv')Loaded and read dataset – You can import tabular data from CSV files into pandas data frame by specifying a parameter value for the file.
columns_to_remove = ['mid', 'venue', 'batsman', 'bowler', 'striker', 'non-striker']
df.drop(labels=columns_to_remove, axis=1, inplace=True)
consistent_teams = ['Kolkata Knight Riders', 'Chennai Super Kings', 'Rajasthan Royals',
'Mumbai Indians', 'Kings XI Punjab', 'Royal Challengers Bangalore',
'Delhi Daredevils', 'Sunrisers Hyderabad']
df = df[(df['bat_team'].isin(consistent_teams)) & (df['bowl_team'].isin(consistent_teams))]
df = df[df['overs']>=5.0]
from datetime import datetime
df['date'] = df['date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d')) Performed Data Cleaning - Data cleaning is the process of ensuring data is correct, consistent and usable. You can clean data by identifying errors or corruptions, correcting or deleting them, or manually processing data as needed to prevent the same errors from occurring. Data cleansing is also important because it improves your data quality and in doing so, increases overall productivity. When you clean your data, all outdated or incorrect information is gone – leaving you with the highest quality information. First, removed unwanted columns. Then, I kept only consistent teams. Third, removed first 5 overs data in every match. Then, converted column ‘date’ from string into datetime object.
encoded_df = pd.get_dummies(data=df, columns=['bat_team', 'bowl_team'])
encoded_df = encoded_df[['date', 'bat_team_Chennai Super Kings', 'bat_team_Delhi Daredevils', 'bat_team_Kings XI Punjab',
'bat_team_Kolkata Knight Riders', 'bat_team_Mumbai Indians', 'bat_team_Rajasthan Royals',
'bat_team_Royal Challengers Bangalore', 'bat_team_Sunrisers Hyderabad',
'bowl_team_Chennai Super Kings', 'bowl_team_Delhi Daredevils', 'bowl_team_Kings XI Punjab',
'bowl_team_Kolkata Knight Riders', 'bowl_team_Mumbai Indians', 'bowl_team_Rajasthan Royals',
'bowl_team_Royal Challengers Bangalore', 'bowl_team_Sunrisers Hyderabad',
'overs', 'runs', 'wickets', 'runs_last_5', 'wickets_last_5', 'total']] Performed Data Pre-processing - Whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis. Data preprocessing is an important step to prepare the data to form a ML model. To make the process easier, data preprocessing is divided into four stages: data cleaning, data integration, data reduction, and data transformation. First, converted categorical features using OneHotEncoding Method that is .get_dummies() function. A one hot encoding allows the representation of categorical data to be more expressive. One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction. There are two different ways to encoding categorical variables. One-hot encoding converts it into n variables, while dummy encoding converts it into n-1 variables. If we have k categorical variables, each of which has n values. Pandas has a function named get_dummies. It will convert your categorical string values into dummy variables. pd.get_dummies create a new dataframe containing unique values as columns which consists of zeros and ones. Secondly, I re-arranged the columns.
X_train = encoded_df.drop(labels='total', axis=1)[encoded_df['date'].dt.year <= 2016]
X_test = encoded_df.drop(labels='total', axis=1)[encoded_df['date'].dt.year >= 2017]
y_train = encoded_df[encoded_df['date'].dt.year <= 2016]['total'].values
y_test = encoded_df[encoded_df['date'].dt.year >= 2017]['total'].values
X_train.drop(labels='date', axis=True, inplace=True)
X_test.drop(labels='date', axis=True, inplace=True) Splitted the data – Splitting data into train and test set in order to prediction w.r.t X_test. This helps in prediction after training data. Then, removed ‘date’ column.
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train) Built the model - The fit() method takes the training data as arguments, which can be one array in the case of unsupervised learning, or two arrays in the case of supervised learning. Fitted the data on Linear Regression Model.
filename = 'first-innings-score-lr-model.pkl'
pickle.dump(regressor, open(filename, 'wb')) Saved the model - as pickle file to re-use it again. Saving the created model for future reference and use so that we can directly access it rather than to go through full cycle again.
from flask import Flask, render_template, request
import pickle
import numpy as np
filename = 'first-innings-score-lr-model.pkl'
regressor = pickle.load(open(filename, 'rb'))
app = Flask(__name__)
@app.route('/')
def home():
return render_template('index.html')
@app.route('/predict', methods=['POST'])
def predict():
temp_array = list()
if request.method == 'POST':
batting_team = request.form['batting-team']
if batting_team == 'Chennai Super Kings':
temp_array = temp_array + [1,0,0,0,0,0,0,0]
elif batting_team == 'Delhi Daredevils':
temp_array = temp_array + [0,1,0,0,0,0,0,0]
elif batting_team == 'Kings XI Punjab':
temp_array = temp_array + [0,0,1,0,0,0,0,0]
elif batting_team == 'Kolkata Knight Riders':
temp_array = temp_array + [0,0,0,1,0,0,0,0]
elif batting_team == 'Mumbai Indians':
temp_array = temp_array + [0,0,0,0,1,0,0,0]
elif batting_team == 'Rajasthan Royals':
temp_array = temp_array + [0,0,0,0,0,1,0,0]
elif batting_team == 'Royal Challengers Bangalore':
temp_array = temp_array + [0,0,0,0,0,0,1,0]
elif batting_team == 'Sunrisers Hyderabad':
temp_array = temp_array + [0,0,0,0,0,0,0,1]
bowling_team = request.form['bowling-team']
if bowling_team == 'Chennai Super Kings':
temp_array = temp_array + [1,0,0,0,0,0,0,0]
elif bowling_team == 'Delhi Daredevils':
temp_array = temp_array + [0,1,0,0,0,0,0,0]
elif bowling_team == 'Kings XI Punjab':
temp_array = temp_array + [0,0,1,0,0,0,0,0]
elif bowling_team == 'Kolkata Knight Riders':
temp_array = temp_array + [0,0,0,1,0,0,0,0]
elif bowling_team == 'Mumbai Indians':
temp_array = temp_array + [0,0,0,0,1,0,0,0]
elif bowling_team == 'Rajasthan Royals':
temp_array = temp_array + [0,0,0,0,0,1,0,0]
elif bowling_team == 'Royal Challengers Bangalore':
temp_array = temp_array + [0,0,0,0,0,0,1,0]
elif bowling_team == 'Sunrisers Hyderabad':
temp_array = temp_array + [0,0,0,0,0,0,0,1]
overs = float(request.form['overs'])
runs = int(request.form['runs'])
wickets = int(request.form['wickets'])
runs_in_prev_5 = int(request.form['runs_in_prev_5'])
wickets_in_prev_5 = int(request.form['wickets_in_prev_5'])
temp_array = temp_array + [overs, runs, wickets, runs_in_prev_5, wickets_in_prev_5]
data = np.array([temp_array])
my_prediction = int(regressor.predict(data)[0])
return render_template('result.html', lower_limit = my_prediction-10, upper_limit = my_prediction+5)
if __name__ == '__main__':
app.run(debug=True) Created Web App - It is made in flask for end-users to use it.
Challenge that I faced in this project is, RF regression is time-consuming so more time in data pre-processing. In IPL, it is very difficult to predict the actual score because in a moment of time the game can completely turn upside down.
Creation of App
Here, I am creating Flask App. Loading the model that we saved then calling and calculating all the factors that are necessary in deciding score of matches. Get the user entered value from the predict function and render respective .html page for solution. You can create it with Streamlit also.
Technical Aspect
Pandas module mainly works with the tabular data. It contains Data Frame and Series. Pandas is 18 to 20 times slower than Numpy. Pandas is seriously a game changer when it comes to cleaning, transforming, manipulating and analyzing data.
Sklearn is known as scikit learn. It provides many ML libraries and algorithms for it. It provides a range of supervised and unsupervised learning algorithms via a consistent interface in Python.
Pickle in Python is primarily used in serializing and deserializing a Python object structure. In other words, it's the process of converting a Python object into a byte stream to store it in a file/database, maintain program state across sessions, or transport data over the network.
Need to train_test_split - Using the same dataset for both training and testing leaves room for miscalculations, thus increases the chances of inaccurate predictions. The train_test_split function allows you to break a dataset with ease while pursuing an ideal model. Also, keep in mind that your model should not be overfitting or underfitting.
Date and datetime are an object in Python, so when you manipulate them, you are actually manipulating objects and not string or timestamps. Working with dates and times is one of the biggest challenges in programming. Between dealing with time zones, daylight saving time, and different written date formats, it can be tough to keep track of which days and times you’re referencing. Fortunately, the built-in Python datetime module can help you manage the complex nature of dates and times. For example, one great example of this irregularity is daylight saving time. In the United States and Canada, clocks are set forward by one hour on the second Sunday in March and set back by one hour on the first Sunday in November. However, this has only been the case since 2007. Prior to 2007, clocks were set forward on the first Sunday in April and set back on the last Sunday in October. Things get even more complicated when you consider time zones. Ideally, time zone boundaries would follow lines of longitude exactly.
Linear regression is the next step up after correlation. It is used when we want to predict the value of a variable based on the value of another variable. The variable we want to predict is called the dependent variable. Because the model is based on the equation of a straight line, y=a+bx, where a is the y-intercept (the value of y when x=0) and b is the slope (the degree to which y increases as x increases one unit). Linear regression plots a straight line through a y vs. x scatterplot. That is why it is call linear regression. Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables: One variable, denoted x, is regarded as the predictor, explanatory, or independent variable. The goal of multiple linear regression (MLR) is to model the linear relationship between the explanatory (independent) variables and response (dependent) variable. In essence, multiple regression is the extension of ordinary least-squares (OLS) regression that involves more than one explanatory variable.
Installation
Using intel core i5 9th generation with NVIDIA GFORECE GTX1650.
Windows 10 Environment Used.
Already Installed Anaconda Navigator for Python 3.x
The Code is written in Python 3.8.
If you don't have Python installed you can install Python from its official site.
If you are using a lower version of Python you can upgrade using the pip package, ensuring you have the latest version of pip, python -m pip install --upgrade pip and press Enter.
Run-How to Use-Steps
Keep your internet connection on while running or accessing files and throughout too.
Follow this when you want to perform from scratch.
Open Anaconda Prompt, Perform the following steps:
cd
pip install pandas
pip install sklearn
pip install numpy
pip install scipy
pip install matplotlib
Note: If it shows error as ‘No Module Found’ , then install relevant module.
You can also create requirement.txt file as, pip freeze > requirements.txt
Create Virtual Environment:
conda create -n ipl python=3.6
y
conda activate ipl
cd
run .py files.
Paste URL to browser to check whether working locally or not.
Follow this when you want to just perform on local machine.
Download ZIP File.
Right-Click on ZIP file in download section and select Extract file option, which will unzip file.
Move unzip folder to desired folder/location be it D drive or desktop etc.
Open Anaconda Prompt, write cd
eg: cd C:\Users\Monica\Desktop\Projects\Python Projects 1\ 23)End_To_End_Projects\ Project_6_ML_FileUse_EndToEnd_Cricket_IPL_FirstInningsScorePrediction\Project_ML_CricketScorePrediction
conda create -n ipl python=3.7
y
conda activate ipl
In Anconda Prompt, pip install -r requirements.txt to install all packages.
In Anconda Prompt, write python app.py and press Enter.
Paste URL to browser to check whether working locally or not.
Please be careful with spellings or numbers while typing filename and easier is just copy filename and then run it to avoid any silly errors.
Note: cd
[Go to Folder where file is. Select the path from top and right-click and select copy option and paste it next to cd one space
Directory Tree-Structure of Project
To Do-Future Scope
Can create app for tennis.
Add columns in dataset of top batsmen and bowlers of all the teams.
Add columns that consists of striker and non-striker's strike rates.
Can deploy on AWS.
Technologies Used/System Requirements/Tech Stack
Conclusion
Modeling
Using RF Regression may give better results.
Analysis
As you can see RMSE is not extremely high however can still reduced it using RF.
Credits
Krish Naik Channel
https://machinelearningmastery.com/linear-regression-for-machine-learning/
Paper Citation
Paper Citation here